Calibration of Speech Data Acquisition Path 



Field of Invention 

This invention relates to speech recognition and more particularly to calibration of speech 
data acquisition path. 

Background of Invention 

A speech acquisition path refers to the whole speech transmission path before the speech 
is actually digitized 

Typical speech acquisition path includes therefore air from lips to the microphone, 
microphone, wires, antialiasing filters, analog-to-digital converter. This is determining the 
transfer function of the system. Noises can be introduced at each of these devices and from 
power supply of the analog-to-digital converter. 

Practical speech acquisition lines, especially for low cost devices, introduce both 
convolutive and additive noises to the input speech, and cause additional statistical mismatch 
between an utterance to be recognized and trained speech model set. Such mismatch will cause 
performance degradation. 

Previously, SNR-dependent cepstral normalization (SDCN), fixed-code-word-dependent 
cepstral normalization (FCDCN) [See A. Acero. Acoustical and Environmental Robustness in 
Automatic Speech Recognition. Kluwer Academic Publishers, 1993], multi-variate Gaussian 
based cepstral normalization [P. Moreno, B. Raj, and R. Stern. Multi-variate Gaussian based 
cepstral normalization. In Proc. of IEEE Internal Conf. on Acoustics, Speech and Signal 
Processing, Detroit, 1995] and statistical re-estimation [P. Moreno, B. Raj, and R. Stern. A 
unified approach to robust speech recognition. In Proceedings of European Conference on 
Speech Communication and Technology, Madrid, Spain, Sept. 1995] have been proposed to deal 
with similar problem. They all assume that the distortions can be modeled by a bias in the 
cepstral domain, which is clearly not the case for additive distortions. Vector Taylor series has 
been used to approximate the distortion as function of cepstral representation of additive and 
convolutive noises. See reference P. J. Moreno, B. Raj, and R. M. Stern. A vector taylor series 
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approach for environment-independent speech recognition. In Proa of IEEE Internal Conf. on 
Acoustics, Speech and Signal Processing, Atlanta, GA, 1996. 

Summary of Invention 

In accordance with one embodiment of the present invention the parameters of the 
convolutive and additive noise are determined by a calibration of the speech acquisition path to 
compensate for the mismatch between an utterance to be recognized and a trained speech model 
set. In the power spectrum domain both types of noises are modeled as polynomial functions of 
frequency. The model parameters are estimated with maximum likelihood (ML) criterion on a 
set of simultaneous recordings. 

Description of Drawings 

Fig. 1 illustrates the equipment configuration for the calibration according to one 
embodiment of the present invention. 

Fig. 2 illustrates the process steps for the calibration; 

Fig. 3 illustrates the results of estimation for convolutive noise estimation at 30db SNR; 

Fig. 4 .illustrates the results of estimation for additive noise elimination at 30db SNR; 

Fig.5 illustrates the results of estimation for convolutive noise estimation at 24db SNR; 

Fig.6 illustrates the results of estimation for additive noise elimination at 24db SNR; 

Fig. 7 illustrates that the estimation of the convolutive bias with independent component 
model gives 4.7 to 8.4 time larger estimation error. 

Fig. 8 illustrates that the estimation of additive bias with independent component model 
gives 7.5 to 12.3 time larger estimation error than with polynomial models. 
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Description of Preferred Embodiments 



In the power-spectral domain, the additive and convolutive noises are modeled as 
polynomial functions of frequency. The model parameters are estimated with Maximum 
5 Likelihood (ML) criterion, on a set of simultaneous recordings. 

Once the parameter of the convolutive and additive noises are determined, speech 
recognizers can be compensated with these parameters, either by speech enhancement, or by 
model adaptation. See references: M. J. F. Gales, "nice" model-based compensation schemes 
for robust speech recognition. In Robust speech recognition for unknown communication 
10 channels, pages 55-64, Pont-a-mousson, France, 1997; Y Gong. Speech recognition in noisy 
u environments: A survey. Speech Communication, 16(3):261-291, April 1995; and C.H. Lee. On 
g feature and model compensation approach to robust speech recognition. In Robust speech 
p recognition for unknown communication channels, pages 45-54, Pont-a-mousson, France, 1997. 
^ The present invention is determining the transfer function and associated noise estimate. 

1ft What we want to determine is the frequency response of the microphone as well as the noise that 
* may have been introduced by the A/D (analog to digital Conversion process. This is particularly 
yj needed for low quality microphones and noisy systems. To model the transfer function we need 
U, to determine the microphone followed by H (the linear filter) and the noise N of the A/D system. 
* Hardware illustrated in Fig. 1 is used. 

20# The equipment 10 used is outlined in Fig. 1. It produces two signals: reference Y R and 

noisy Y N . Y R is assumed to give a digitized speech signal under the same acoustic environment of 
the speech used to train the recognizer and is represented by microphone 1 1 and DAT (high 
quality data recording) equipment 15. For this reference H is 1 and N is zero. Y N is the noisy 
speech signal under the acoustic environment of the target application with the test microphone 

25 13 and equipment under test path 17. Reference microphone 1 1 is the one used for recording the 
training database, and test microphone 13 is the one to be used in the target product for field 
operation. The equipment test path 17 may introduce both convolutive noises (microphone, pre- 
A/D (pre analog to digital ) filter )and additive noises (environmental noises, any noises 
introduced before A/D, etc.), even when there is no background noise. The purpose here is to 

30 estimate the two noise components, with very short speech utterance (typically one) represented 
byX. 
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In the equipment configuration Fig.l, we identify two environments: reference (R) and 
noisy (N). We represent a signal by its power spectral density sampled with a DFT (Discrete 
Fourier Transform) of 2M dimensions. Using Y R and Y N the distortion parameters are 
5 determined. For a given signal X, reference signal is 

and that of the noisy signal is: 

10 



lift 



25 



(1) 



(2) 



where e is assumed to be a zero mean Gaussian distribution with diagonal covariance matrix, i.e., 
e :[<>,*/■ 



m 

q As the pre-A/D filtering 15a as well as other parts of the reference equipment are of much 

2p higher acoustic quality than that of most speech recognizers, it is assumed that H R contains the 
information on the reference microphone 11. H N models the test microphone 13 and the pre AID 
filter 17a of the equipment test path N N models the noise background introduced at any point of 
the test equipment. H R and H N are both M-dimensional diagonal matrices, and Nn a M- 
dimensional vector. One can remove X by substituting H R ~ 1 Y R for X. 



From Equation 1 and Equation 2, we have: 



30 (3) 
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with 



5 H A AH N ff~ R (ratio of transfer function) 

(4) 

We are interested in the estimate acoustic changes represented by H A and N N . There are so many 
combinations to deteimine these values. A model is assumed herein that Y N is only true when Y N 
10 follows a Gaussian distribution as in equation 5. 



fe- 



rn 

XT7f 



25 



Let the i-th observed Y N be Y N (i), and the i-th observed Y R be Y R (i). The likelihood of a set of T 
observed noisy signals is then: 

P(Y N (l),Y N (2),...Y N (T)\X} = 



1^ T -, 1 



D ,i/2 ex p 



r (5) 



With limited amount of data, direct estimation of the parameters H A eR M and N N €R M of the 
model may give unreliable results in noisy conditions. We propose to further limit the solutions 
20 of the two set of parameters in the spaces spanned by polynomial functions. 

The further modeling to constraint the values of H A uses the polynomial model. We 
assume the H A has a value as a function of frequency and the change is not sudden but is smooth 
and a polynomial of a low order. We assume a noise will follow a Gaussian distribution. 
Let k e [0, M) be the frequency index, and 



vWA^6[0,l] 
— M 



(6) 
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be the normalized frequency: In order to reduce the number of free parameters and improve the 
robustness of the estimation, we further assume that H A is an order-P polynomial function of 
normalized frequency v: 



where 



H^^9 H ,y 2 9 H ,...y k e H ,..^ M 9 H \ 



(7) 



10 



(8) 



(9) 



l|| Similarly, we assume that Nn is an order-Q polynomial function of normalized frequency. 



(10) 



20 where 



25 



^v{k)y 2 \k),...v Q -\k)) 



(ii) 



(12) 



The model parameter set is then: 



AA{9 H ,9 N ,R} 
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Determination of parameters: 

When you change H A and Nn in a Gaussian distribution it will change the shape. We change the 
5 shape so that the probability of observing Y N is maximized and is represented by equations 14 
and 15. 

Polynominal coefficients; 

Using maximum likelihood criterion to determine the parameter set \ we have 



° p(YAl)J N (2)^Y N (T)\Z) 



10 



lSI with 



20 



d6 H 



0< /? te(i),^(2),...7 Ar (r)|^) 



3^ 



(17) 



By interchanging the summations, Equation 14 and Equation 15 can be rewritten as: 
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7=1 *=1 i=I 7=1 *=1 i=l 

= Sv^(*)j;yi(oyt(o 

/c_1 i-i 



Denote: 



P M 



v # Z^Z^wZrto+rZ^Z^w 

7=1 *=i 1=1 7=1 m 

M J 

=Z v?_, wZ7t(o 



i=l 1=1 



51 
Ml 



a(»,/,g)A£v"(t)2;/*(iV(0 

*=i /=i 



M T 

P(m,f)^Y J v m {k)Y J f k i}) 



k=\ i-1 



15 



A/ 

K™)atJV(£) 



and 



20 



4 aU? . i<f • • • -X'J ™* 4? - «0> + y - « , r. ) 
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^kb:\Dl^---DT^D[ J) = r( q + j-2) 



(25) 



(26) 



10 



u aJm, ,u 2 ,...u p ] with w p = a(p - 1, 7, Y N ) 



(27) 



Up 



o 



va[^ ,u 2 , . . .v Q ] with v q = y^fe - 1, Y N ) 



(28) 



lP Equation 19 and Equation 20 can be expressed as a linear system where H and N parameters are 
variable as follows: 



20 



Or, equivalently: 



7=1 7=1 



(29) 



(30) 



25 



where 
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(31) 



(32) 

5 (33) 

^(gxp) A[CpC 2 ,...C ! g] =B 

(34) 
(35) 

Mi 

if Equation 3 1 is a linear system of P + Q variables, and can be solved by a general linear system 
HI solution method. See W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. 
W Numerical Recipes in C. The Art of scientific programming. Cambridge University Press, 
1988. 

ff Solving linear system equations: 



p Alternatively, a more efficient solution can be used, which solves one linear equation of order P 

hi 

"'H and another of Q, rather than jointly solving linear systems with (P+Q) variables. 
20 

From Equation 31 and Equation 34, we have the following block linear equations: 



A0 H +B0 N =v 



25 B f 0 H +D0 N =v 



(36) 
(37) 



From Equation 36, 
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(38) 

From Equation 37 and Equation 38 we obtain a linear system of equation on On: 



(d-B'A- 1 b)9 n =v-B 1 A~ 1 u 

(39) 

Solving Equation 3 1 can therefore be achieved by first solving Equation 39 for 0n and then using 
Equation 38 for 0 H . Similarly, for 6 H , we can derive, from Eq-37, 

e N =D~ l (v-B%) 

(40) 

From Eq-36 and Eq-40 we obtain a linear system of equation on 0 H : 

(a-BD-'B'^ =u-BD~ l v 

(41) 

Solving Equation 3 1 can therefore also be achieved by first solving Equation 41 for 8 H and then 
using Equation 40 for 0 N . 

Depending on the order of polynomials, one of the two solutions could be computationally more 
efficient than the other. Finally, we point out the property that A and D are symmetrical and 
contain only positive elements. That property can be exploited for more efficient solutions. 

Covariance matrix: 

To solve for the coefficients of the covariance matrix, we make use of the two equalities: 



-^-log\A |= A' 1 

OA 



(42) 
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—a'Aa = aa' 
dA 



(43) 



To calculate the covariance matrix, we set the derivative of the logarithm of Equation 5 with 
5 respect to R-l to zero, which gives: 

* = 7 1 fc (0 - # a r, (0^ (0 - # a ii (0 - ^ > 

(44) 

W To quantify the goodness of the fitting between the model and data, we use: 
U e= Trace (5). 



(45) 



i§ Referring to Fig. 2 there is illustrated the process steps for calibration. A voice utterance 

i is applied to a high quality path and a test path as illustrated in Fig.l (Step 1) and in Step 2 a 
measurement is made for each frame. For each frame of each utterance i the power spectrum is 
determined for reference Y R and test Y N After all frames are measured Step 3, for each utterance 
calculate equations 32-35 and 27 and 28 for A,B,C,D,u and v (Step 4). In Step 5 the noise 
20 estimate 0 N and the channel estimate 0 H are calculated using equation 3 1 or the noise estimate 0 N 
using equation 39 and channel estimate Gh is calculated using equation 38. 

Referring to the test equipment 10 of Fig.l, the outputs Y R and Y N are applied to 
processor 19 that processes the signals according to Fig. 2 described above to produce the 
channel estimate 9h and the noise estimate 9n signal outputs. These output may be displayed on 
25 a display or directly applied to modify acoustic models in a recognizer. For example, the test 
equipment described in Fig. 1 may be used to test a cellular phone in a car. After a test utterance 
in the car using the test calibration unit described above the reference signal and the test signal 
are processed and the equipment provides a channel estimation and noise estimation either on the 
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display for the user to manually enter the findings in the speech recognizer of the car or the 
outputs are provided directly to the recognizer. 

To show this actually works we generate this kind of environment. The goal is to 
identify them back. The system has to recover the H and N from the observed Y N . One is 

5 synthesized distortion and the other is recovered. 

To simulate a distortion, the test speech signal yR(n) is modified in time domain according to the 
following operation: 

10 yN (n) = yR (n) * h A (n) + n N (n) 

(46) 

ipSi 

P The speech signal in the example is an utterance of the digit string two five eight nine six 

)* oh four (5.6 seconds). ho(n) is a five-coefficient FIR band-pass filter with cutoff frequencies at 
^ 0.10 and 0.35 and 26dB attenuation. nN(n) is a computer generated white Gaussian noise 

6 sequence. 

In speech signal, the energy is concentrated at low frequencies, which yields higher 

05 polynomial fitting error at high frequencies. To balance the errors, speech signal is 

ES 

2$% pre-emphasized. As pre-emphasizing speech signal is a common practice in speech recognition, 
Ha and Nn estimated using pre-emphasized speech could be directly usable for compensation. 

Throughout the experiments reported, 

25 • we used 9th order of polynomes for convolutive noises (P = 9), and 6th order of 

polynomes for additive noises (Q — 6). 

• Noise estimate shown in the figures (labeled as "noise.PW") are obtained by averaging 30 
seconds of noise power spectra. 

30 
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• In the figures below, "C" stands for convolutional noise, and "A" for additive noise. 
"POLY" stands for an estimate obtained by polynomial models. "FILTER" stands for the 
frequency response of the band-pass FIR filter. 

To measure the estimation error, we use averaged inner product of the difference between 
the estimated noise and a reference: 

e v A±(v- v y(v-v) V v e{H A9 N N } 

(47) 

• For convolutive noise, the reference H A is the power spectrum of the filter h A (n). 

• For additive noise, the reference N N is the average of power spectra of the noise 
sequence. 

Fig 3 to Fig 6 shows the results of estimation for convolutive and additive noises. In 
order to test the robustness against additive noise, A white Gaussian noise is introduced to the 
test channel, with variable SNR. 

The results are shown for a wide SNR ranges from 30dB to -6 dB. In each figure, estimate by 
independentcomponent bias model and by polynomial bias model are shown along with a 
reference bias. The following can be observed: 

• At 30dB, the estimates of convolutive noise by both models are fairly close to the reference 
(Fig. 3). However, for additive noise, while the polynomial model gives good estimate, the 
estimate of additive noise by independent-component bias model shows large error with 
respect to the reference (Fig.4). 

To show the relative improvement using polynomial models, Fig.7 and Fig. 8 plot the 
estimation error according to Equation 47, as function of SNR. 
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• Fig. 7 shows that the estimation of convolutive bias with independent component model 
gives 4.7 to 8.4 time larger estimation error than with polynomial models. 

• Fig. 8 shows that the estimation of additive bias with independent component model gives 
7.5 to 12.3 time larger estimation error than with polynomial models. 
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